December 4, 2017

Scenario: A company approaches you to predict data scientist salaries with machine learning.

Let's predict data scientist salaries

What is Machine Learning

Machine learning is a method for teaching computers to make and improve predictions or behaviours based on data.

Step 1: Find some data

Kaggle conducted an industry-wide survey of data scientists. https://www.kaggle.com/kaggle/kaggle-survey-2017

Information asked:

  • Compensation
  • Demographics
  • Job title
  • Experience

Contains information from Kaggle ML and Data Science Survey, 2017, which is made available here under the Open Database License (ODbL).

Step 2: Throw ML on your data

library('mlr')
set.seed(42)
task = makeRegrTask(data = survey.dat, target = 'CompensationAmount')
lrn = makeLearner('regr.randomForest', importance=TRUE)
mod = train(lrn, task)

Step 3: Profit.

"There is a problem with the model!"

What problem?

"The older the applicants, the higher the predicted salary, regardless of skills."

Individual Conditional Expectation

ice = generatePartialDependenceData(mod, task, features ='Age', 
                                    individual = TRUE)
plotPartialDependence(ice) + scale_y_continuous(limits=c(0, NA))

Goldstein, A., Kapelner, A., Bleich, J., & Pitkin, E. (2013). Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation, 1–22. https://doi.org/10.1080/10618600.2014.907095

ice.c = generatePartialDependenceData(mod, task, features ='Age', 
          individual = TRUE, center = list(Age=20))
plotPartialDependence(ice.c)

Partial dependence plots

pdp = generatePartialDependenceData(mod, task, features =c('Age'))
plotPartialDependence(pdp) + scale_y_continuous(limits=c(0, NA))

Friedman, J. H. (1999). Greedy Function Approximation : A Gradient Boosting Machine. North, 1(3), 1–10. https://doi.org/10.2307/2699986

"We want to understand the model better!"

Permutation feature importance

feat.imp = getFeatureImportance(mod, type=1)$res
dat = gather(feat.imp, key='Feature', value='Importance') %>% arrange(Importance)
dat$Feature = factor(dat$Feature, levels = dat$Feature)
ggplot(dat)  + geom_point(aes(y=Feature, x = Importance))

Breiman, Leo. "Random forests." Machine learning 45.1 (2001): 5-32.

Gender?!

pdp = generatePartialDependenceData(mod, task, features =c('Gender'))
ggplot(pdp$data) + geom_point(aes(x=Gender, y=CompensationAmount)) + 
  geom_segment(aes(x=Gender, xend=Gender, yend=CompensationAmount), y=0) + 
  scale_y_continuous(limits=c(0, NA)) + 
  theme(axis.text.x = element_text(angle = 10, hjust = 1))

LIME

explanation <- lime(dat, mod)
explainer <- lime::explain(dat[3, ], explanation, n_features = 3)
plot_features(explainer, ncol=1)

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. Retrieved from http://arxiv.org/abs/1602.04938

Interested in learning more?